Capstone Project: The Battle of the Neighborhoods

Introduction

My friends living in Cologne, Germany wish to relocate for professional reasons to Berlin, Germany. They reached out to me for recommendations on the Berlin suburbs. Rental prices are not their priority. They mentioned metrics like:

They know such a decision is very subjective but since I have also lived in Cologne and I know the Cologne suburbs quite a bit, they asked me if I could make a mapping of Berlin-to-Colone suburbs. That way, since they are familiar with the Cologne suburbs, they can get a first impression idea of the Berlin suburbs and reach easier a decision more tailored to their needs.

Data needed for this project

In order to tackle such a problem, I collected the following data:

  1. Official names of the Berlin and Cologne suburbs were web-scraped. Representative nodes of the suburbs were queried from the Open Street Maps API. Data were corrected to reflect the officially recognized suburbs and their names. At the end of this data extraction and transformation, I had all officially recognized suburbs of Berlin and Cologne represented as nodes in WGS (latitude/longitute) coordinates.

  2. The Foursquare API was queried around each representative suburb node using a radius that was determined separately for Berlin and Cologne. I wanted to capture enough of each suburb character allowing for query overlaps from neighboring suburbs in order to blend the suburb boundaries. This is because the reality on the ground is not influenced in any way by the administrative boundaries. A neighborhood can evolve across suburb boundaries and maintain its character. In such case, if the majority of the neighborhood lies only on one side of the boundary the similarity with the neighboring suburb will be missed if we do not allow for query blendings.

  3. For the coordinate system transformation (CRS) from the World Geodesic System (WGS) latitude/longitude to Universal Transverse Mercador (UTM) cartesian I used EPSG:5243 which is appropriate for Germany.

Methodology

Once the correct list of suburbs was acquired and geolocalized, we visualized the results with folium for final inspection. The next task was to determine the Foursquare query radius to use for each city. For that, I identified the nearest-neighbor of each suburb and computed the corresponding Euclidean distance. I then computed percentile statistics and visualized the distribution of distances for each city separately for inspection. In the end, I decided for radii in the range of the lower 10th percentile of nearest-neighbor suburb distances in each city.

I queried Foursquare for food, shop and services venues separately in each case and combined the results for each suburb. I kept all venues that were commonly found in Berlin and Cologne. For each city I removed outlier suburbs in the lower quartile of venue numbers. I then added the venue frequencies across the two cities and identified and removed from the analysis outlier venues that were in the lower quartile range.

Our goal was to make of map of Cologne suburbs to Berlin suburbs by minimizing their dissimilarity based on their number and type of venues. I quantified suburb dissimilarity by the normalized Euclidean distance in feature space. The normalization was just the number of features entering the analysis in order to make the dissimilarity measure invariant to changes in this number.

In order to make it easy for my friends, I reported the top two similar suburbs of Berlin for each of the suburbs of Cologne. Also, for each pair I reported the top similar and top dissimilar features these two suburbs had.

Results

I identified all 96 of the Berlin suburbs and all 86 of the Cologne suburbs using the Open Street Maps API. Below I show snapshots of the city maps indicating the representative node for each suburb.

The next task was to decide on the Foursquare query radius (in meters) around the representative points of the suburbs for each city separately. For that, I converted the latitude/longitude coordinates to UTM and computed the Euclidean distance of each suburb to its nearest-neighbor. I show below a visualization of these nearest-neighboring distances for the suburbs of the two cities, where the zoom level might be different in each case.

The distributions of the nearest-neighbor distances for the two cities are visualized below. In the end, I decided the query radius to be close to the first quartile of nearest-neighbor distances for each city, namely for Berlin I used 1400 meters and for Cologne 1000 meters.

Food, shops and services queries to Foursquare were done separately. I consulted the general categories provided by Foursquare and decided to use the following broad categories:

category categoryId
food 4d4b7105d754a06374d81259
shop & service 4d4b7105d754a06378d81259

I identified and kept for the analysis only the common venues between the two cities. As it was expected, some remote suburbs of Berlin and Cologne did not turn up many venues. I show below the distribution of the number of venues per suburb for each city. In the end, I decided to remove suburbs in the bottom 25%. I was sure my friends would not be interested in those anyway.

Finally, I pooled the venue frequencies across all suburbs of one city with those of the other city to get a final estimate of freature frequency in my dataset and I removed for further analysis features in the bottom 25% range of frequencies.

At this point I was ready for the main analysis. I standardized the features across suburbs in order to avoid the most populous, like Supermarkets dominating the analysis. Then, for each suburb of Cologne present in the analysis I computed its normalized Euclidean distance in feature space to all the Berlin suburbs present in the analysis. This resulted in a dissimilarity metric that was invariant to the number of features entering the analysis. Its range could vary. Values very close to zero indicated strongly similar suburbs. Values in the range of [0.1, 0.2] indicated somewhat similar suburbs and the rest indicated dissimilar suburbs.

For each suburb of Cologne I picked the two most similar suburbs of Berlin, i.e. those that had the smallest dissimilarity metric. I also identified the top features where the suburbs were most similar and most dissimilar and included them in the final table map. In the table below, I show the final map of Cologne to Berlin suburbs that this analysis produced.

On the one hand, there were Cologne suburbs very close in character to the Berlin ones. On the other hand, there were also suburbs which seemed to be unique in their Cologne character and their closest counterpart in Berlin was quite dissimilar. This can also be seen by the distribution of the dissimilarity metric shown below.

Discussion

I found the Cologne to Berlin suburb map very interesting. It turned out, the Cologne city center suburbs were very distant from all choices in Berlin and vice versa, no Berlin city center suburb was found similar with any of the Cologne suburbs. Citi centers seem to have a unique non-transferable character. Althought it should be said that a character of a suburb cannot be captured only by looking at food venues, shops and services. Nevertheless, this simple first approach to the problem gave my friends some broad-stroke ideas about the Berlin suburbs in order to get them going and help them narrowing down their choices.

On the Data Science side of things, I can see many areas where this analysis could be expanded and improved. Firstly, I will move away from the Foursquare API limitations and switch completely to the Open Street Maps API where it is possible to query multipolygon areas for anything with no limitations. That way the whole suburb area can be queried for features instead of a fixed radius around a node. Also, there is a wealth of information that one can add to the suburb character features that is not even available at Foursquare, like number of trees, area of the suburb covered by greenery, public transporation density, etc. Although such an analysis will take me beyond the scope of this project, I still consider it an interesting project to work on in the future.

The question asked by my friends was a very specific one. I was tempted to use machine-learning methods like K-means clustering in order to identify groups of suburbs across the cities. However, such an analysis would not have answered the question that my friends asked. If anything, it would have left them more confused having to choose by themselves among the suburbs that co-clustered with a Cologne suburb. I chose a way of analysis that was tailored to the exact question asked: Can you make a map of the Cologne to Berlin suburbs for us?

Conclusions

A map of the Cologne to Berlin suburbs was made by looking at very basic information like food venues, shops and various services. Suburbs of Cologne were mapped to the two closest suburbs in Berlin, listing also the distance in Euclidean feature space as well as information about the top similar and top dissimilar features involved. The results were quite satisfactory. This analysis can serve as a blueprint for future work that can expand it and improve on the type of features used in order to capture more accurately the character of suburbs.